Search CORE

16 research outputs found

Cross-lingual alignments of ELMo contextual embeddings

Author: Robnik-Šikonja Marko
Ulčar Matej
Publication venue
Publication date: 22/07/2021
Field of study

Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To produce cross-lingual mappings of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context. We address this issue with a novel method for creating cross-lingual contextual alignment datasets. Based on that, we propose several cross-lingual mapping methods for ELMo embeddings. The proposed linear mapping methods use existing Vecmap and MUSE alignments on contextual ELMo embeddings. Novel nonlinear ELMoGAN mapping methods are based on GANs and do not assume isomorphic embedding spaces. We evaluate the proposed mapping methods on nine languages, using four downstream tasks: named entity recognition (NER), dependency parsing (DP), terminology alignment, and sentiment analysis. The ELMoGAN methods perform very well on the NER and terminology alignment tasks, with a lower cross-lingual loss for NER compared to the direct training on some languages. In DP and sentiment analysis, linear contextual alignment variants are more successful.Comment: 30 pages, 5 figure

arXiv.org e-Print Archive

Computer Speech Recognition in Slovene Language

Author: Ulčar Matej
Publication venue
Publication date: 03/10/2018
Field of study

Manual transcription of speech is slow and is being replaced by automatic speech recognition systems. These systems are also used for voice control of various programs and devices. In this thesis, we used as a baseline for Slovene speech recognition GMM-HMM methods for acoustic model and n-grams for language model. We improved both models with deep neural networks, which have proven to be very successful. We tested several architectures of time-delayed neural networks and neural networks with long short-term memory for both acoustic and language model. We used a large lexicon, containing about a million words. Time-delayed neural networks achieved the best results on continuous speech, with 72,84% of correctly identified words

Repository of the University of Ljubljana

ePrints.FRI

FinEst BERT and CroSloEngual BERT: less is more in multilingual models

Author: Robnik-Šikonja Marko
Ulčar Matej
Publication venue
Publication date: 14/06/2020
Field of study

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situationsComment: 10 pages, accepted at TSD 2020 conferenc

arXiv.org e-Print Archive

Slovene and Croatian word embeddings in terms of gender occupational analogies

Author: Anka Supej
Marko Robnik-Šikonja
Matej Ulčar
Senja Pollak
Publication venue: 'University of Ljubljana'
Publication date: 01/07/2021
Field of study

In recent years, the use of deep neural networks and dense vector embeddings for text representation have led to excellent results in the field of computational understanding of natural language. It has also been shown that word embeddings often capture gender, racial and other types of bias. The article focuses on evaluating Slovene and Croatian word embeddings in terms of gender bias using word analogy calculations. We compiled a list of masculine and feminine nouns for occupations in Slovene and evaluated the gender bias of fastText, word2vec and ELMo embeddings with different configurations and different approaches to analogy calculations. The lowest occupational gender bias was observed with the fastText embeddings. Similarly, we compared different fastText embeddings on Croatian occupational analogies

Directory of Open Access Journals

ELMo embeddings models for seven languages

Author: Ulčar Matej
Publication venue: University of Ljubljana
Publication date: 25/11/2019
Field of study

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level. Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library

Common Language Resources and Technology Infrastructure - Slovenia

ELMo embeddings model, Slovenian

Author: Ulčar Matej
Publication venue: University of Ljubljana
Publication date: 15/10/2019
Field of study

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus (https://viri.cjvt.si/gigafida/System/Impressum) for 10 epochs. 1,364,064 most common tokens were provided as vocabulary during the training. The model can also infer OOV words, since the neural network input is on the character level

Common Language Resources and Technology Infrastructure - Slovenia

Computer Speech Recognition in Slovene Language

Author: Ulčar Matej
Publication venue
Publication date: 12/10/2018
Field of study

Ročno zapisovanje govora je počasen proces, ki ga čedalje bolj nadomešča avtomatsko razpoznavanje govora. Slednje se lahko uporablja tudi za glasovno upravljanje programov in naprav. V magistrski nalogi smo kot osnovo za razpoznavanje govorjene slovenščine uporabili uveljavljene metode GMM-HMM za akustični model in n-gramov za jezikovni model. Modela smo nadgradili z uporabo globokih nevronskih mrež, ki so se izkazale za zelo uspešne. Preizkusili smo različne arhitekture časovno zakasnjenih nevronskih mrež in nevronskih mrež z dolgim kratkoročnim spominom na akustičnem in jezikovnem modelu razpoznavalnika govora. Razpoznavalnik smo učili na širokem besednjaku, ki vsebuje približno milijon različnih besed. Najboljše rezultate dosegajo časovno zakasnjene nevronske mreže, kjer smo dosegli 72,84% pravilno prepoznanih besed pri tekočem govoru.Manual transcription of speech is slow and is being replaced by automatic speech recognition systems. These systems are also used for voice control of various programs and devices. In this thesis, we used as a baseline for Slovene speech recognition GMM-HMM methods for acoustic model and n-grams for language model. We improved both models with deep neural networks, which have proven to be very successful. We tested several architectures of time-delayed neural networks and neural networks with long short-term memory for both acoustic and language model. We used a large lexicon, containing about a million words. Time-delayed neural networks achieved the best results on continuous speech, with 72,84% of correctly identified words

Repository of the University of Ljubljana

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

Author: Robnik-Šikonja Marko
Ulčar Matej
Publication venue: University of Ljubljana
Publication date: 29/12/2020
Field of study

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers

Common Language Resources and Technology Infrastructure - Slovenia

CroSloEngual BERT 1.1

Author: Robnik-Šikonja Marko
Ulčar Matej
Publication venue: University of Ljubljana
Publication date: 09/07/2020
Field of study

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). Changes in version 1.1: fixed vocab.txt file, as previous verson had an error causing very bad results during fine-tuning and/or evaluation

Common Language Resources and Technology Infrastructure - Slovenia